4,875 research outputs found

    Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

    Full text link
    As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet made from reviewer comment

    Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models

    Full text link
    We express the mean and variance terms in a double exponential regression model as additive functions of the predictors and use Bayesian variable selection to determine which predictors enter the model, and whether they enter linearly or flexibly. When the variance term is null we obtain a generalized additive model, which becomes a generalized linear model if the predictors enter the mean linearly. The model is estimated using Markov chain Monte Carlo simulation and the methodology is illustrated using real and simulated data sets.Comment: 8 graphs 35 page

    Machine Learning for Quantum Mechanical Properties of Atoms in Molecules

    Get PDF
    We introduce machine learning models of quantum mechanical observables of atoms in molecules. Instant out-of-sample predictions for proton and carbon nuclear chemical shifts, atomic core level excitations, and forces on atoms reach accuracies on par with density functional theory reference. Locality is exploited within non-linear regression via local atom-centered coordinate systems. The approach is validated on a diverse set of 9k small organic molecules. Linear scaling of computational cost in system size is demonstrated for saturated polymers with up to sub-mesoscale lengths

    A Bayesian spatio-temporal model of panel design data: airborne particle number concentration in Brisbane, Australia

    Get PDF
    This paper outlines a methodology for semi-parametric spatio-temporal modelling of data which is dense in time but sparse in space, obtained from a split panel design, the most feasible approach to covering space and time with limited equipment. The data are hourly averaged particle number concentration (PNC) and were collected, as part of the Ultrafine Particles from Transport Emissions and Child Health (UPTECH) project. Two weeks of continuous measurements were taken at each of a number of government primary schools in the Brisbane Metropolitan Area. The monitoring equipment was taken to each school sequentially. The school data are augmented by data from long term monitoring stations at three locations in Brisbane, Australia. Fitting the model helps describe the spatial and temporal variability at a subset of the UPTECH schools and the long-term monitoring sites. The temporal variation is modelled hierarchically with penalised random walk terms, one common to all sites and a term accounting for the remaining temporal trend at each site. Parameter estimates and their uncertainty are computed in a computationally efficient approximate Bayesian inference environment, R-INLA. The temporal part of the model explains daily and weekly cycles in PNC at the schools, which can be used to estimate the exposure of school children to ultrafine particles (UFPs) emitted by vehicles. At each school and long-term monitoring site, peaks in PNC can be attributed to the morning and afternoon rush hour traffic and new particle formation events. The spatial component of the model describes the school to school variation in mean PNC at each school and within each school ground. It is shown how the spatial model can be expanded to identify spatial patterns at the city scale with the inclusion of more spatial locations.Comment: Draft of this paper presented at ISBA 2012 as poster, part of UPTECH projec

    High-Dimensional Inference with the generalized Hopfield Model: Principal Component Analysis and Corrections

    Get PDF
    We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum Lik elihood inference is deeply related to Principal Component Analysis when the amp litude of the pattern components, xi, is negligible compared to N^1/2. Using techniques from statistical mechanics, we calculate the corrections to the patterns to the first order in xi/N^1/2. We stress that it is important to generalize the Hopfield model and include both attractive and repulsive patterns, to correctly infer networks with sparse and strong interactions. We present a simple geometrical criterion to decide how many attractive and repulsive patterns should be considered as a function of the sampling noise. We moreover discuss how many sampled configurations are required for a good inference, as a function of the system size, N and of the amplitude, xi. The inference approach is illustrated on synthetic and biological data.Comment: Physical Review E: Statistical, Nonlinear, and Soft Matter Physics (2011) to appea

    Varying-coefficient modeling via regularized basis functions

    Full text link
    We address the problem of constructing varying-coefficient models based on basis expansions along with the technique of regularization. A crucial point in our modeling procedure is the selection of smoothing parameters in the regularization method. In order to choose the parameters objectively, we derive model selection criteria from the viewpoints of information-theoretic and Bayesian approach. We demonstrate the effectiveness of proposed modeling strategy through Monte Carlo simulations and analyzing a real data set.Comment: 10 pages, 4 figure

    Fast stable direct fitting and smoothness selection for Generalized Additive Models

    Get PDF
    Existing computationally efficient methods for penalized likelihood GAM fitting employ iterative smoothness selection on working linear models (or working mixed models). Such schemes fail to converge for a non-negligible proportion of models, with failure being particularly frequent in the presence of concurvity. If smoothness selection is performed by optimizing `whole model' criteria these problems disappear, but until now attempts to do this have employed finite difference based optimization schemes which are computationally inefficient, and can suffer from false convergence. This paper develops the first computationally efficient method for direct GAM smoothness selection. It is highly stable, but by careful structuring achieves a computational efficiency that leads, in simulations, to lower mean computation times than the schemes based on working-model smoothness selection. The method also offers a reliable way of fitting generalized additive mixed models

    How to best threshold and validate stacked species assemblages? Community optimisation might hold the answer

    Get PDF
    1. The popularity of species distribution models (SDMs) and the associated stacked species distribution models (S-SDMs), as tools for community ecologists, largely increased in recent years. However, while some consensus was reached about the best methods to threshold and evaluate individual SDMs, little agreement exists on how to best assemble individual SDMs into communities, i.e. how to build and assess S-SDM predictions. 2. Here, we used published data of insects and plants collected within the same study region to test (1) if the most established thresholding methods to optimize single species prediction are also the best choice for predicting species assemblage composition, or if community-based thresholding can be a better alternative, and (2) whether the optimal thresholding method depends on taxa, prevalence distribution and/or species richness. Based on a comparison of different evaluation approaches we provide guidelines for a robust community cross-validation framework, to use if spatial or temporal independent data are unavailable. 3. Our results showed that the selection of the “optimal” assembly strategy mostly depends on the evaluation approach rather than taxa, prevalence distribution, regional species pool or species richness. If evaluated with independent data or reliable cross-validation, community-based thresholding seems superior compared to single species optimisation. However, many published studies did not evaluate community projections with independent data, often leading to overoptimistic community evaluation metrics based on single species optimisation. 4. The fact that most of the reviewed S-SDM studies reported over-fitted community evaluation metrics highlights the importance of developing clear evaluation guidelines for community models. Here, we move a first step in this direction, providing a framework for cross-validation at the community level
    corecore